import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import numpy as np
In this notebook, we will attempt to estimate the error in the UN City Population Predictions for African cities. The predictions are taken from the 2018 Revision of World Urbanization Prospects. The predictions were made in 2018 and predict city populations from 2019 to 2035. To test these predictions, we sourced more recent population estimates that were made after 2018.
In a previous notebook, me merged the UN population predictions with the more recent city population estimates that we found e.g. if we found a census population estimate of Nairobi for 2021, then we matched this with the 2021 UN population prediction for Nairobi.
Below we load in this data:
cities = pd.read_csv('UN_projections_compared.csv')
cities.head()
| Region | Country_Code | Country_or_area | City_Code | City | data_sources_UN | City_Definition | Latitude | Longitude | year | ... | city_cleaned | UN_proj_error | UN_proj_error_% | mag_UN_proj_error | mag_UN_proj_error_% | latest_census_data_UN | latest_estimate_data_UN | most_recent_UN_source | years_since_latest_UN_source | data_source_type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Africa | 854 | burkina faso | 23191 | bobo-dioulasso | Censuses of 1951, 1960, 1975, 1985, 1996 and 2... | City Proper | 11.17715 | -4.29790 | 2019 | ... | bobo-dioulasso | 19450.0 | 0.021041 | 19450.0 | 0.021041 | 2006 | 2015 | Estimate | 3 | Population Census |
| 1 | Africa | 854 | burkina faso | 23192 | ouagadougou | Censuses of 1951, 1960, 1975, 1985, 1996 and 2... | City Proper | 12.36423 | -1.53834 | 2019 | ... | ouagadougou | 237454.0 | 0.089513 | 237454.0 | 0.089513 | 2006 | 2015 | Estimate | 3 | Population Census |
| 2 | Africa | 140 | central african republic | 20410 | bangui | Censuses of 1955, 1964, 1975, 1988 and 2003; E... | Urban Agglomeration | 4.36122 | 18.55496 | 2023 | ... | bangui | -601665.0 | -0.627823 | 601665.0 | 0.627823 | 2003 | 1966 | Census | 15 | Population Estimate |
| 3 | Africa | 178 | congo | 20848 | brazzaville | Censuses of 1955, 1961, 1974, 1984, 1996 and 2... | City Proper | -4.26583 | 15.28318 | 2019 | ... | brazzaville | 326076.0 | 0.141276 | 326076.0 | 0.141276 | 2007 | 1950 | Census | 11 | Population Projection |
| 4 | Africa | 178 | congo | 20849 | pointe-noire | Censuses of 1958, 1974, 1984, 1996 and 2007; S... | City Proper | -4.77609 | 11.86352 | 2019 | ... | pointe-noire | 143513.0 | 0.122085 | 143513.0 | 0.122085 | 2007 | 1962 | Census | 11 | Population Projection |
5 rows × 33 columns
The key variable we will focus on is UN_projerror%. This variable is defined mathematically below:
$UN\_proj\_error\_\% = \frac{population\_UN - population\_measured}{population\_UN}$
This expresses the prediction error (the difference between the UN population estimate and the measured population) as a percentage of the original UN estimate. For example, if the UN predicted a population of 1,000 and the measured population was 500, then the error is 1,000 - 500 = 500. We then divide this by the original UN estimate 1,000 to get a UN_projerror% of 0.5 or 50%.
A positive value means the UN over-estimated the population and a negative value means the UN under-estimated the population.
We decided to use this rather than the raw error value (the difference between the UN predicted population and the measured city population) because we strongly suspect the error will be proportional to the size of the UN Population estimate (e.g. a projection of 1,000,000 will likely have a larger typical raw error figure than a projection of 400,000).
By finding an error value that is expressed as a fraction of the UN population estimate, we remove the variation in error due to the size of the UN population. This way, we can hopefully find a typical figure e.g. 15% that all UN predictions vary according to, regardless of how large or small they are.
Below, we create 2 plots to validate our assumptions:
fig1 = px.scatter(cities, y = "mag_UN_proj_error",
x = 'population_UN', hover_name='City',
range_x=(0, 6000000), title = 'Raw Error vs Projected UN Population')
fig1.show()
# Note we use used the magnitue of the raw error, because we are not
# interested in whether they are over or under estimates, only if the
# size of the error correlates with the size of the population predicted
fig2 = px.scatter(cities, y = "mag_UN_proj_error_%", x = 'population_UN',
hover_name='City', range_x=(0, 6000000),
title = 'Error as Fraction of UN Population vs Projected UN Population')
fig2.show()
Observations:
The measured city data was sourced from https://citypopulation.de/ which provided 3 types of population estimate (defined here):
We believe the Population Census figures are typically reliable. We assume they are a source of truth, which we can directly compare to UN population predictions. We are not so sure about the Population estimates and Population projections. Since we have relatively few census population figures to use (because only a small number of countries have had a census since 2018) we think it is worth including these other possible less reliable figures.
We will start our analysis by comparing the 3 different estimate types to see if they show significantly different error trends. If they do, it may imply we should be careful when using them to estimate the accuracy of the UN projections.
Below, we quantify the number of observations we have within each of the 3 city population estimate categories:
fig3 = px.histogram(cities, x = 'data_source_type',
title = 'Count of Population Figures per Estimate Type',
color = 'data_source_type')
fig3.show()
Observations:
We now want to identify if the different estimate types show similar or different error trends. Firstly, we plot a histogram of the UN_projerror%, where we color each of the different estimate types separately:
fig4 = px.histogram(cities, x="UN_proj_error_%",
color = 'data_source_type',
title = 'Distribution of Projection Error per Population Estimate Type')
fig4.show()
Observations:
We will now look at a box plot of absolute error values (no negatives) to view the data in a different way:
fig5 = px.box(cities, y="mag_UN_proj_error_%",
color = 'data_source_type',
title = 'Projection Error Box Plots for Different Population Estimate Types')
fig5.show()
Observations:
Finally, we will calculate some summary statistics for each estimate type and see what conclusions we can draw from them:
metrics = ['mean', 'median', np.std, 'max', 'min']
cities.groupby('data_source_type')['UN_proj_error_%'].agg(metrics)
| mean | median | std | max | min | |
|---|---|---|---|---|---|
| data_source_type | |||||
| Population Census | -0.032778 | 0.021041 | 0.234340 | 0.507310 | -0.531148 |
| Population Estimate | -0.079687 | -0.084522 | 0.184166 | 0.292731 | -0.627823 |
| Population Projection | 0.073150 | 0.084221 | 0.127437 | 0.261639 | -0.179335 |
Observations:
Caution - the small mean and median values for Population Census are not measures of typical accuracy. The negative and positive values cancel each other out, leading to smaller or larger values, this is a measure of balance. Performing the same calculation on the magnitude of the error yields very different results.
Conclusions:
In this section, we will try to estimate the typical projection error in a UN population prediction. This typical error will be used by city planners to help them understand how reliable an estimate is likely to be. As such, we will be looking to create a conservative estimate that captures the largest likely error. This will hopefully allow city planners to bear in mind and prepare for worst case large errors in the prediction which could cause major problems in city planning.
This means we will be looking for the largest error boundary that would capture the majority of population predictions. It will not however include outliers. We have explored 2 outliers to understand how they may be interpreted. In one case (Freetown in Sierra Leone) the census data has been called into question because it showed the population halve since the last census. In other cases, it seems the UN data was simply quite wrong. Either way, the typical error will be a measure of the maximum width of typical (i.e. the most common) projection errors.
We will create a rough estimate of the typical error by calculating the same statistical metrics above but this time on the magnitude of the error:
cities.groupby('data_source_type')['mag_UN_proj_error_%'].agg(metrics)
| mean | median | std | max | min | |
|---|---|---|---|---|---|
| data_source_type | |||||
| Population Census | 0.174853 | 0.150237 | 0.154198 | 0.531148 | 0.021041 |
| Population Estimate | 0.143775 | 0.109804 | 0.137640 | 0.627823 | 0.000153 |
| Population Projection | 0.117446 | 0.115836 | 0.083847 | 0.261639 | 0.024128 |
Observations:
The mean and median estimates of error are only rough estimates. They do not clearly define the number of projections whose errors sits within that threshold (e.g. the maximum error percentage for 80% of UN projections).
To get this data, we create a cumulative distribution plot below on the magnitude of the error:
fig6 = px.histogram(cities, x = 'mag_UN_proj_error_%', cumulative=True,
histnorm='percent', height = 900,
facet_row='data_source_type',
title = 'Cumulative Distribution of Projection Error per Population Estimate Type',
color = 'data_source_type')
fig6.show()
Observations:
Conclusion:
In the plots below, we have tried to convey the error in the UN population predictions for every individual prediction.
We will first plot the UN Population Prediction against the Measured Population for every single prediction in the dataset.
If the UN perfectly predicted the population, then the predictions would sit along the y = x straight line from the origin (plotted in blue). The amount a prediction diverges from this straight line is a measure of the error in the prediction. If the prediction over-estimates the population, it will sit above the straight line. If the prediction under-estimates the population, it will sit below it.
We have shaded in the +-30% UN_projerror% boundary above and below the straight line that captures the majority of the observations.
Please note - the +-30% boundary is not symmetric across the y = x line. This may seem wrong and unintuitive at first, but we are confident it is correct.
This behaviour is due to the way the percentage error is defined. Recall that the percentage error was defined as the difference between the projected population and the measured population divided by the projected population:
$UN\_proj\_error\_\% = \frac{population\_UN - population\_measured}{population\_UN}$
Dividing by the projected population is what causes this behaviour. If we move vertically up from the y = x line, the Projected Population keeps increasing. This means that you need a larger raw error (vertical distance from y = x) to get to 30% (because you are dividing by an ever larger number to get the percentage). As you move vertically down from y = x, the projected population is decreasing and thus you are dividing by an ever smaller number. Thus, you reach the raw error required for a 30% threshold faster.
If you look carefully, you will see that the horizontal distance between y = x and the boundary is always the same on either side. This is because a horizontal slice represents the population projection staying constant, and thus it takes the same size of raw error to reach 30% on either side.
The equations for the +-30% boundary lines is defined in the Appendix.
# Define data that will be available when you hover over a data point
hover_data = ['Country_or_area', 'year', 'population_UN', 'population',
'UN_proj_error', 'UN_proj_error_%','data_source_type',
'years_since_UN_projection', 'months_offset_from_midyear',
'most_recent_UN_source', 'years_since_latest_UN_source']
# Updated html code of the hover info to make it more readable and
# change the order.
hover_template = '<b>%{hovertext}</b> - %{customdata[0]} (%{customdata[1]})<br><br>population_UN=%{y}<br>population=%{x}<br>UN_proj_error=%{customdata[2]}<br>UN_proj_error_%=%{customdata[3]}<br>data_source_type=%{customdata[4]}<br>years_since_UN_projection=%{customdata[5]}<br>months_offset_from_midyear=%{customdata[6]}<br>most_recent_UN_source=%{customdata[7]}<br>years_since_latest_UN_source=%{customdata[8]}<extra></extra>'
hover_template = hover_template.replace('=', ' = ')
labels = {'population_UN':'UN Predicted Population',
'population':'Measured Population',
'data_source_type':'Measured Population Estimate Type'}
# Create scatter plot
fig7 = px.scatter(cities, x = "population", y = 'population_UN',
color = 'data_source_type', height = 600,
hover_name = 'City', hover_data = hover_data,
title = 'UN Predicted Population vs Measured Population',
labels = labels)
# Update hover info background color and font
fig7.update_layout(
hoverlabel=dict(
bgcolor="white",
font_size=10,
font_family="Rockwell"
)
)
# Update html of hover data in plot
for data in fig7.data:
data.hovertemplate = hover_template
# calculate maximum population that will be plotted
pop_max = cities.population.max()
extra = 2500000
# plot y = x line from 0 to max_pop_plotted which is a bit beyond the
# largest population that will be plotted.
max_pop_plotted = pop_max + extra
# Define the error boundary we will plot (e.g. set as 0.3 to plot
# a +-30% error boundary)
error = 0.3
# We will plot straight lines above and below the y = x straight line.
# These lines will show the position of the +- 30% error region (anything
# outside these lines has an error larger than 30%). They will both go
# through the origin. We need to define the position of the rightmost point
# on each line (the end of the line). Plotly will connect this point to the
# origin to draw the line.
# the rightmost point must be expressed in terms of x and y, which in this
# case means Population_UN and population. pop_max defines the x coordinate
# of the max point. The below equations define the y coordinate (population_UN)
# of the rightmost point for the 2 lines (pos for above and neg for below y = x).
max_plotted_error_pos = max_pop_plotted / (1 - error)
max_plotted_error_neg = max_pop_plotted / (1 + error)
# define line for the +-30% lines (it will be invisible)
invisible_line = go.scatter.Line(color='rgb(0,0,0)', width = 0)
# define line quality for the y = x line (it will be blue)
visible_line = go.scatter.Line(color='blue')
# create marker object to hide markers at the end of the lines
marker = go.scatter.Marker(opacity =0)
# Define the transparent color of the +- 30% error region
fillcolor = 'rgba(0,50,0,0.1)'
# add invisible +30% error region line, not shown in legend
fig7.add_trace(
go.Scatter(x=[0, max_pop_plotted], y=[0,max_plotted_error_pos],
name="error_over", line_shape='linear', opacity = 0,
line = invisible_line, marker = marker, showlegend=False)
)
# add invisible y = x line, and fill between this line and the +30% line above
fig7.add_trace(
go.Scatter(x=[0, max_pop_plotted], y=[0,max_pop_plotted],
name="+30% Prediction Error (over-estimate)",
line_shape='linear', line= invisible_line, fill='tonexty',
opacity = 0.01, marker = marker, fillcolor = fillcolor)
)
# Add invisible -30% error region line, not shown in legend
fig7.add_trace(
go.Scatter(x=[0, max_pop_plotted], y=[0,max_plotted_error_neg],
name="error_under", line_shape='linear', opacity = 0.01,
line=invisible_line,marker = marker, showlegend=False)
)
# add invisible y = x line, and fill between this line and the -30% line below
fig7.add_trace(
go.Scatter(x=[0, max_pop_plotted], y=[0,max_pop_plotted],
name="-30% Prediction Error (under-estimate)",
fillcolor= fillcolor, line_shape='linear', line= invisible_line,
fill='tonexty' ,opacity = 0.01,marker = marker)
)
# add visible blue y = x line
fig7.add_trace(
go.Scatter(x=[0, max_pop_plotted], y=[0,max_pop_plotted],
name="Perfect Population Prediction", fillcolor= fillcolor,
line_shape='linear', line= visible_line ,marker = marker)
)
fig7.show()
Observations:
Below we directly plot the projection error against the predicted population. The x axis uses a log scale to spread out the points more so the plot is more readable (otherwise the points are concentrated towards the left side of the chart and are hard to read). This is a simpler plot that easily and intuitively conveys the +- 30% error boundaries. The 20% and 30% error boundaries have been indicated via highlights. A marginal violin plot is shown on the right hand side. This provides a simplified view of the distribution of the errors and shows how the 20% and 30% boundaries capture different amounts of the projections we have tested:
labels = {'population_UN':'UN Predicted Population',
'UN_proj_error_%':'Error In Prediction',
'data_source_type':'Measured Population Estimate Type'}
# Updated html code of the hover info to make it more readable and
# change the order.
hover_template = '<b>%{hovertext}</b> - %{customdata[0]} (%{customdata[1]})<br><br>population_UN=%{x}<br>population=%{customdata[2]}<br>UN_proj_error=%{customdata[3]}<br>UN_proj_error_%=%{y}<br>data_source_type=%{customdata[4]}<br>years_since_UN_projection=%{customdata[5]}<br>months_offset_from_midyear=%{customdata[6]}<br>most_recent_UN_source=%{customdata[7]}<br>years_since_latest_UN_source=%{customdata[8]}<extra></extra>'
hover_template = hover_template.replace('=', ' = ')
# Create scatter plot
fig8 = px.scatter(cities, y="UN_proj_error_%", x = 'population_UN',
color = 'data_source_type', marginal_y = 'violin',
log_x=True, height = 600, hover_name = 'City',
hover_data = hover_data, labels = labels,
title = 'Projection Error vs Population Prediction')
# Set y axis to show percentage e.g. 0.3 becomes 30%
fig8.layout.yaxis.tickformat = ',.0%'
# Update hover info background color and font
fig8.update_layout(
hoverlabel=dict(
bgcolor="white",
font_size=10,
font_family="Rockwell"
)
)
# Update html of hover data in plot
for data in fig8.data:
data.hovertemplate = hover_template
# Add the +-30% error region as rectangle shading
fig8.add_hrect(y0=-0.3, y1=0.3, fillcolor="green", opacity=0.25,
line_width=0,
annotation_text="<b>20% - 30% </b>Error Region",
annotation_position="inside bottom right",
annotation=dict(font_size=12,
font_family="Rockwell", font_color = 'green'))
# Add the +-20% error region as rectangle shading
fig8.add_hrect(y0=-0.2, y1=0.2, fillcolor="orange", opacity=0.25,
line_width=0,
annotation_text="<b>20%</b> Error Region",
annotation_position="inside top right",
annotation=dict(font_size=12, font_family="Rockwell",
font_color = 'yellow'))
# Remove annotation text from marginal plot
fig8.layout.annotations = (fig8.layout.annotations[0],
fig8.layout.annotations[2])
fig8.show()
Observations:
fig1 = px.box(cities.sort_values('years_since_UN_projection'),
y="UN_proj_error_%",
facet_row = 'City_Definition', height = 1000)
fig1.show()
fig7.write_html('UN Population Prediction vs Measured Population.html')
fig8.write_html('Projection Error vs Population Prediction.html')
Save Plotly Data so the interactive plots work on NB viewer.
# make notebook interactive on nb-viewer
import plotly
plotly.offline.init_notebook_mode()
Here we document the derivation of the 2 straight line equations that define the +-30% error boundary on the UN projected Population vs Measured Population plot:
Recall that the percentage error was defined as the difference between the projected population and the measured population divided by the projected population:
$UN\_proj\_error\_\% = \frac{population\_UN - population\_measured}{population\_UN}$
We are looking for the +-30% error boundary, so we set the UN_proj_error_% to +-0.3:
$\pm0.3 = \frac{population\_UN - population\_measured}{population\_UN}$
And now we re-arrange the equation into a straight line format $y = mx$ such that we are expressing the y axis (population_UN) in terms of the x axis (population_measured) multiplied by a gradient term m.
$population\_UN_{\pm 30\%} = \frac{population\_measured}{1\pm0.3}$
Any point on either of these lines has a combination of Population_UN and population_measured that results in a 30% error (either above or below the 0% error population_UN = population_measured line).